NBA Career Prediction Experiment

Aim:

To revisit my earlier Random Forest, and attempt to improve it by applying Gausiian Mixture clusters, HyperOpt tuning and examiniation of partial dependece and LIME.

To improve on the best result so far from the team, which was a polynomial logistic regression, with features set ['GP', 'MIN', 'FG%', '3P Made', '3P%', 'FTM', 'FT%', 'OREB', 'DREB', 'AST', 'STL', 'BLK', 'TOV']

Findings:

Results (05)

Results (05a) with Gaussian Mixture Clusters added as a feature:

Set up

Data

Decisions

We will retain all potential features, when using non-linear models.

We will set negative values to ABS(value); the numbers appear sensible but for the sign and there are over 3000 observations with one or more negatives.

and TARGET_5Yrs is our target.

Cleaning

Scaling & splitting of data

Feature engineering - Gaussian clusters

Decision about clusters:

Include as new feature.

Modelling

Model tuning

Fit best model

Evaluation

Variable importance by permutation

Partial dependence plot

Explain specific observations with LIME

Hypothesis to test: Feature "GP" might be behind many false negatives, so would the model be improved without it?

Apply to test data for submission